BioCoder: a benchmark for bioinformatics code generation with large language models

Abstract Summary Pretrained large language models (LLMs) have significantly improved code generation. As these models scale up, there is an increasing need for the output to handle more intricate tasks and to be appropriately specialized to particular domains. Here, we target bioinformatics due to the amount of domain knowledge, algorithms, and data operations this discipline requires. We present BioCoder, a benchmark developed to evaluate LLMs in generating bioinformatics-specific code. BioCoder spans much of the field, covering cross-file dependencies, class declarations, and global variables. It incorporates 1026 Python functions and 1243 Java methods extracted from GitHub, along with 253 examples from the Rosalind Project, all pertaining to bioinformatics. Using topic modeling, we show that the overall coverage of the included code is representative of the full spectrum of bioinformatics calculations. BioCoder incorporates a fuzz-testing framework for evaluation. We have applied it to evaluate various models including InCoder, CodeGen, CodeGen2, SantaCoder, StarCoder, StarCoder+, InstructCodeT5+, GPT-3.5, and GPT-4. Furthermore, we fine-tuned one model (StarCoder), demonstrating that our training dataset can enhance the performance on our testing benchmark (by >15% in terms of Pass@K under certain prompt configurations and always >3%). The results highlight two key aspects of successful models: (i) Successful models accommodate a long prompt (>2600 tokens) with full context, including functional dependencies. (ii) They contain domain-specific knowledge of bioinformatics, beyond just general coding capability. This is evident from the performance gain of GPT-3.5/4 compared to the smaller models on our benchmark (50% versus up to 25%). Availability and implementation All datasets, benchmark, Docker images, and scripts required for testing are available at: https://github.com/gersteinlab/biocoder and https://biocoder-benchmark.github.io/.


Introduction
Large language models (LLMs) have demonstrated great success in code generation (Chen et al. 2021, 2023, Chowdhery et al. 2022, Barke et al. 2023, Li et al. 2023).The landscape of existing coding benchmarks for LLMs is largely populated with simple functions, often limited to a handful of lines (Austin et al. 2021, Chen et al. 2021, Du et al. 2023, Wong et al. 2023).Combined with a significant lack of closeddomain datasets across diverse fields, this landscape highlights the need for a more robust benchmarking system.Although domain-specific datasets, such as DS1000 (Lai et al. 2022) for data science, have emerged, they fall short of adequately addressing specific tasks in fields like bioinformatics.Open-domain alternatives, including HumanEval (Chen et al. 2021), MBPP (Austin et al. 2021), and APPS (Hendrycks et al. 2021), offer entry-level programming tasks, but their utility is limited as they lack the ability to test more niche, domain-specific code blocks.This shortfall is largely due to a lack of appropriate fine-tuning and context (Muennighoff et al. 2023b).Therefore, a more comprehensive approach to benchmarking is clearly needed.
To address these limitations, we introduce BioCoder (see Fig. 1), a benchmark for code generation incorporating 2269 bioinformatics-specific coding problems.Our BioCoder benchmark mainly targets bioinformatics data analysis, which tasks such as managing various biological data formats, understanding processing workflows, and utilizing APIs of various packages.This domain encapsulates the majority of daily tasks encountered by bioinformaticians in data analysis.However, BioCoder also touches upon aspects of writing bioinformatics software, particularly when tool development intersects with data analysis.Further expanding the scope of BioCoder, we included an additional 253 questions from the Rosalind project.This project specializes in generating Python functions that address key bioinformatics topics such as genetic sequencing and DNA/RNA analysis.BioCoder assures the inclusion of all potential external packages and code that could be utilized by the generated program.This consideration extends to recognizing that realworld functions often necessitate managing multiple external function calls and using global variables; hence, we included all potentially required class declarations in the input.Lastly, we performed ablation studies to determine whether the models are strictly memorizing the solutions rather than being proficient at generating code (see Supplementary Appendix M).
The key highlights of our work can be outlined as follows: (i) We create a new high-quality dataset for code generation, curated from 1720 bioinformatics repositories referenced in peer-reviewed bioinformatics articles.We processed the data, rephrasing more detailed text descriptions, as well as associated comments and specifications, including considerations needed in coding.(ii) We provide an extendable parsing tool capable of extracting all pertinent information associated with the target function in expansive projects.(iii) We provide a library for code LLMs, similar to Bui et al. (2023), for both training and inference in code generation tasks.(iv) We provide a fuzz testing tool capable of scaling to handle substantial datasets.Our benchmark results, derived from 1000 iterations, indicate the Pass@K rate.

Related work
BIOCODER is a code generation benchmark designed for challenging, practical bioinformatics scenarios, offering an extensible testing framework for evaluating the performance of LLMs.We provide a brief overview of the related work in both code generation models and benchmarks.

Code generation datasets and benchmarks
Early work on code generation benchmarks used lexical exact match, data flow, and abstract syntax tree (AST) methods.However, these measures proved to be unreliable due to their sensitivity to inconsequential differences in the generated code.In response, execution-based evaluation approaches have become more prevalent (Chen et al. 2021, Khlaaf et al. 2022, Lai et al. 2022, Li et al. 2022, Wang et al. 2022b, Athiwaratkun et al. 2023).These approaches execute tests on the generated code to verify its functional correctness, ensuring unbiased evaluations irrespective of implementation method or style variations.
As a result, the field of code generation has seen a burgeoning number of execution-based benchmarks (Table 1) (Lee et al. 2023, Pan et al. 2023, Wong et al. 2023, Yuan et al. 2023, Zan et al. 2023), each presenting unique properties in terms of size, language coverage (Orlanski et al. 2023), complexity (Du et al. 2023, Zhuo 2023), and practical applicability (Yu et al. 2023).For instance, HumanEval (Chen et al. 2021) and MBPP (Austin et al. 2021) are frequently used code generation benchmarks that consist of 164 and 974 simple Python functions, respectively, representing a small sample size.These benchmarks also overlook the multi-language coding scenarios gap, which is partially bridged by benchmarks like HumanEval-X (Zheng et al. 2023) and MCoNaLa (Wang et al. 2023b).For a more comprehensive survey on the previous benchmarks of code generation, refer to Zan et al. (2023).
However, all datasets discussed above share the same shortcoming of only benchmarking generic functions, rather than domain-specific ones.DS-1000 (Lai et al. 2022) represents a more domain-specific dataset, featuring 1000 data science workflows extracted from Python functions.Li et al. (2023) reported that the performance on HumanEval and MBPP benchmarks does not always align with those on the DS-1000 benchmark.This discrepancy underscores the need for benchmarks that more accurately emulate real-world, domain-specific code generation.
In addition, the context supplied greatly influences the performance of existing LLMs (Wang et al. 2022a).While DS-1000 includes eight packages, it fails to fully reflect a typical coding environment.This gap is partially bridged through benchmarks such as CoderEval (Yu et al. 2023), which incorporate some dependencies and function calls.However, these benchmarks are rudimentary in nature and consist primarily of domain-agnostic functions.As LLMs continue to evolve, we are now beginning to see repository-level benchmarks that provide a high amount of context, such as RepoBench (Liu et al. 2023).However, these benchmarks remain new and untried.
Our work shares common aspects with CoderEval in its ability to evaluate models beyond the simple generation of standalone functions.Both methodologies employ Dockerbased testing to handle the necessity of context-dependent code.However, our approach distinguishes itself from CoderEval by its specific emphasis on bioinformatics.We ensure that each function in our dataset requires a certain level of domain expertise in bioinformatics through a combination of automatic filtering, GPT-assisted filtering, and manual inspection.Furthermore, our dataset surpasses the scale of CoderEval, which consists of 230 functions from 43 Python projects and 230 methods from ten Java projects.In contrast, we source 2522 functions from over 2000 repositories, providing a more extensive and challenging context for code generation tasks.A comprehensive comparison between our benchmark and CoderEval can be found in Supplementary Appendix G.

Initial dataset filtering to a set of 28 repositories
Our dataset originates from an initial web scrape of 1743 bioinformatics-related GitHub repositories (see Fig. 2).Specifically, we utilized the list of 1743 bioinformaticsadjacent repositories from Russell et al. (2018) as the foundation for BIOCODER.This list contains a curated selection of 1720 bioinformatics repositories sourced from the literature.The collected repositories include code written in various programming languages such as C, Cþþ, PHP, Python, R, Ruby, SQL, Perl, Java, Matlab, and C#.However, for the scope of this study, we focus on Python and Java, with the intention to expand to other languages in the future.The decision to prioritize Java and Python was based on an empirical investigation into the prevalence of different programming languages across bioinformatics repositories.A more detailed discussion of this language selection process can be found in Supplementary Appendix P.
The repositories were then filtered based on popularity, community ratings, and a manual review process.This resulted in a set of 28 high-quality, highly domain-specific repositories commonly used in the field of bioinformatics.After determining this set of repositories, we developed custom Python and Java parsers to automatically analyze the selected GitHub repositories.These parsers generated an AST for each code file in the repositories and extracted relevant data, including function content, function signatures, important imports, and cross-file dependencies for each function within the code files.Upon parsing all the repositories, we obtained a large set of over 20 000 Python functions and more than 50 000 Java functions.Given this extensive baseline of functions, we conducted two rounds of automatic filtering, resulting in a final count of 1026 Python functions and 1243 Java functions (Table 2).

Topic distribution in the selected repositories
To gain an understanding of the distribution of bioinformatics within our set of 28 repositories, we applied latent Dirichlet allocation (LDA) to the abstracts of articles citing each repository.Each of these selected repositories contains the codebase associated with a single bioinformatics journal article.We used LDA to infer topics for the abstracts of articles citing each repository in the main dataset.Specifically, from the LDA model, we identified terms that were primarily associated with a single topic.We chose a model with eight topics due to its maximal coherence of concepts within the top topic-specialized terms.Finally, these eight topics were then manually labeled to summarize the top terms, resulting in the following categories: (i) Cancer and epigenetics, (ii) Proteomics and microscopy, (iii) Variant calling, (iv) Genetics and population analysis, (v) Structure and molecular interaction, (vi) Web and graphical applications, (vii) Assembly and sequence analysis, and (viii) Transcription and RNA sequencing.A detailed description of each topic can be found in Supplementary Appendix N. Our function topic filtering process can be found in Supplementary Appendix V.

Filtering the repositories to a list of core functions
To further filter and find a small set of functions, we started with a large baseline of functions-i.e.all the functions in the 28 repositories above-and initiated two rounds of automatic filtering to reduce the manual workload.The first round involved keyword filtering, where each function and its comments required at least 10 matches with bioinformatics-related keywords scraped from Wikipedia articles, as mentioned earlier.
The methodology for obtaining this Wikipedia-based wordlist can be found in Supplementary Appendix V. Subsequently, we performed a second round of filtering, during which the OpenAI GPT-3.5 model assessed the bioinformatics relevance of each function.Finally, we manually sorted through the remaining functions, resulting in 1026 Python functions and 1243 Java functions (see Table 2).The "similar data" set in Table 2 includes an additional 157 Python functions and 50 Java functions, maintaining the same 253 Rosalind function count, reflecting the composition of the public data.These additional functions were selected to closely align with the same statistics of the public data, such as the distribution of comment lines and token counts.
Our function selection process aimed to strike a balance, ensuring that the final dataset comprises truly bioinformaticsfocused functions applicable to our study.This filtering process was undertaken by experts with knowledge in bioinformatics, highlighting the essential role of bioinformatics understanding in this work.
Although our benchmark for code generation is general in nature, it is rooted in the context of bioinformatics, utilizing curated and filtered datasets based on bioinformatics problems (see Supplementary Appendix N for more details on the topic modeling and statistics regarding the overall topic coverage of the dataset).While an understanding of bioinformatics and biology may not be essential for using the benchmark, it was built to reflect the complexity and domain specifics of bioinformatics.

BioCoder-Py and BioCoder-Java
For each function that passed all rounds of filtering described in Section 3.1, we manually wrote custom code context, including necessary imports, cross-file dependencies, and relevant fuzz test cases (detailed in Section 3.6).We then created custom prompts based on the parsed function data and summaries, ensuring the inclusion of required imports and crossfile dependencies (see Fig. 4).As we are testing function-level code generation, imports and classes are predefined and included in the context.We are not prompting the model to generate the classes needed to pass the tests, but rather testing its ability to extract pertinent imports and classes from the context for use in the generated function.Table 3 provides prompt statistics.Finally, we presented the model with a prompt to generate the function, offering the function signature as a starting point.Supplementary Appendices B and H contain examples of different prompt types.Prompts were partially generated using GPT-3.5, which was used to create function summaries for all functions in the public dataset.These summaries were incorporated into the prompts to efficiently describe the functions.Supplementary Appendix E provides more details on this method.Figure 3 shows two examples of the resulting prompt.

BioCoder-Rosalind
To compile the Rosalind portion of the benchmark, we began by scraping the problem descriptions from the Rosalind website, identifying problems with available solutions, and gathering all possible solutions.Subsequently, we developed a custom scraper to assemble ten test cases for each Rosalind problem.Using these test cases, we crafted a script to automatically assess whether the available solutions were successfully executed against the collected test cases.

BioCoder i269
Solutions that successfully executed against all test cases formed the "golden code" section of the Rosalind benchmark, producing correct outputs when run with the test cases.Each Rosalind benchmark context is custom-made, incorporating the scraped test cases and injecting them into the generated code.The prompts for the Rosalind problems are constructed using the scraped problem descriptions, supplemented with a brief section outlining the context into which the generated code would be integrated.This rigorous filtering process resulted in 253 functions meeting all our criteria.Selected examples for the Rosalind dataset are shown in Supplementary Appendix C. Statistics of token counts, comment lines per function, and parameters per function can be found in Supplementary Appendix A.

Metric
We used the Pass@K metric to measure the functional accuracy (Chen et al. 2021, 2022, Cassano et al. 2023) of code generation models.This metric quantifies, for a certain value K, the probability that the model can solve a particular programming problem when generating K candidate solutions.A problem is deemed "solved" if at least one of the K generated code samples passes all the test cases.Erepresents the numerical estimation for a particular problem.Each code sample represents a complete function or program intended to solve the problem.The mathematical estimation of Pass@K for a particular problem is articulated as follows: where n is the number of samples generated by the model, c is the number of samples that pass all test cases, and K is the number of samples considered for the Pass@K evaluation (Chen et al. 2021).

Testing framework
Our testing framework begins with a manual review of selected functions, followed by the creation of a context file and golden code for each problem (see Fig. 4 for an example), as discussed in 3.4.Our testing strategy is a hybrid of unit testing and fuzz testing methods, which shares similarities with the metamorphic testing methodology described in Chen et al. (2009).In metamorphic testing, both a reference implementation and the test code are provided with parametrically generated input data to ensure identical behavior.While our approach is not strictly metamorphic testing, it leverages similar principles by using a golden code as a reference and generating random test inputs to compare outputs.
For Python and Java functions, we use a custom syntax in the context file to indicate insertion points for randomly generated test cases, representing four data types: integers, floats, strings, and Boolean values.During runtime, these insertion points are replaced with language-specific code to insert dynamically generated test cases.The tester can be run for any number of iterations, depending on the desired number of fuzz tests.
For Rosalind functions, the process is simpler and more efficient as the functions are less complex.The output of the golden code is generated and cached ahead of time.During testing, the tester executes the generated code within the corresponding context and compares the output with the cached golden code output.
We ran the golden output against itself for every fuzz and Rosalind test case to ensure 100% reliability.To ensure system security and test reliability, we ran our tests in Docker environments using Amazon Web Services, coordinating tasks across multiple nodes to accelerate the process without compromising result validity.After creating a generalized Docker image with all necessary Python requirements, we summarized our testing framework in Supplementary Appendix K and addressed potential concerns about testing issues due to package changes in Supplementary Appendix S.
To target specific performance characteristics, we came up with hundreds of variations of the prompt.We chose three goals: test the performance of models with extraneous context, without extraneous context, and any context.These goals allow us to better analyze failure reasons and the effectiveness of our context-driven approach.After careful experimentation, we settled on the prompt type shown in Fig. 3, which we call Summary at Bottom.Following the instruction paradigm of some considered models, we test a version with the summary moved to the top, along with the text "# Here is an instruction.Complete the function using the required context."To test without extraneous context, we used human annotators to manually determine the required context and used the structure of the Summary at Top prompt.Further prompt explanations can be found in Supplementary Appendix H.
Below is an explanation of the prompt types: 1) Summary Only: These prompts only contain the summary and the function signature, with the uncommented summary coming before the signature.Note that the summary includes nearly complete details about the task; however, it intentionally does not thoroughly explain what the context is.Therefore, this result is best treated as a baseline when compared with other prompt types.2) Uncommented: These prompts contain the full parsed context (including the imports, global variables classes, internal class functions, etc), the summary, and the function signature, in that order.For functions exceeding ten lines in the context, we summarize the parameters, return type, and purpose instead of including the full function code.This step streamlines the number of input tokens and eliminates extraneous data.3) Summary at Bottom: These prompts have the same structure as the uncommented ones, but we add the context as a comment.In addition, there are no results for "summary at bottom" for Java due to incompatibility with Java syntax.We were unable to generate this type of prompt for Java in a similar manner to how we generated it for Python.4) Summary at Top: These prompts contain the summary, the full (commented) parsed context, and the function signature, in that order.For Java, the summary is not copied at the bottom.This is intended for models with shorter context lengths, as when we truncated the prompt (usually only affecting the context), the summary would still be intact, along with a portion of the context.5) Necessary Only: We use a mixture of our syntax-solving algorithm and hand annotation to select precisely which objects within the context are necessary for the function to execute.Note that this is very similar to the environment used for testing the functions.
To accurately represent the performance of the LLM outputs, we implemented basic correction mechanisms to rectify minor syntax and style errors that did not impact  functionality.For instance, all StarCoder outputs were appended with a postscript.Each LLM output was then passed through these correction mechanisms before being sent to the testing framework for evaluation (see Tables 5  and 6).Furthermore, to empirically evaluate the hypothesis regarding the efficacy of smaller, specialized LLMs in closeddomain code generation, as opposed to large open-domain pretrained models like GPT-3.5 and GPT-4, we fine-tuned StarCoder and documented the resulting performance.We chose StarCoder as a representative sample of currently popular models.Due to computing constraints, we were unable to fine-tune all the models, but we encourage contributions from the broader community.Inference was executed on HPC clusters equipped with 8× A100 GPUs.
The results in Tables 5 and 6 align with our initial hypothesis, which proposed that larger models would likely outperform their smaller counterparts.However, the significant performance gap between GPT-3.5, GPT-4, and all other codegeneration models was surprising.This underscores the crucial role of both the dataset size and parameter size of the base models in accomplishing closed-domain code generation prompts.Java performance improved significantly, as the structure is similar between the training set and testing set.Interestingly, despite the rudimentary nature of our fine-tuning on StarCoder, the results still highlighted a significant improvement compared with the nonfine-tuned model.This stark contrast in performance bolsters our original assertion: achieving success in closed-domain tasks can be realized either through large opendomain LLMs or via fine-tuning smaller models.These smaller models could potentially achieve comparable performance but with significantly reduced computational and memory requirements.Furthermore, Table 5 demonstrates that the performance of models improves with the inclusion of dependencies in

Figure 1 .
Figure 1.Overview of the contributions of BioCoder

Figure 2 .
Figure 2. A diagram of the BIOCODER construction process involving custom GitHub repository cleaning, parsing, and function selection, along with context and test case creation and a massively dockerized testing framework

Figure 3 .
Figure3.Sample prompts for code generation.Our prompts follow the same general outline.First, imports are declared at the top of the prompt, then global variables (if any), followed by function declarations, class dependencies, and finally, our actual instructions regarding the function to be generated

Table 1 .
Comparison of the statistics of BIOCODER to previous benchmarks.Test, average amount of test cases; P.C., average number of characters in each prompt; P.L., average number of lines in each prompt; C.C., average number of characters in the original code solutions; C.L., average number of lines in the original code solutions.This table is derived fromZan  et al. (2023).Please refer toZan et al. (2023)for a more comprehensive survey.

Table 2 .
Summary statistics for the BIOCODER dataset.
G.T., ground truth function;Public data, datasets with test cases; Hidden data, encompasses a wider array of intricate issues; Similar data, subset of the hidden data, mimicking the distribution of the public data (Supplementary Appendix T).

Table 4 .
Context length limits and sizes of different code LLMs.Test case for UntangleWorms example.The context file includes various import dependencies and a class definition with a method placeholder for the solution.The UntangleWorms class comes from a GitHub repository file (https://github.com/CellProfiler/CellProfiler/blob/master/cellprofiler/modules/untangleworms.py)that was scraped in our study.UntangleWorms is an image analysis tool that was initially part of the paper "An image analysis toolbox for high-throughput C.elegans assays"